We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardwareaware network architecture search (NAS) complemented by the NetAdapt algorithm and then subsequently improved through novel architecture advances. This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art. Through this process we create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. For the task of semantic segmentation (or any dense pixel prediction), we propose a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP). We achieve new state of the art results for mobile classification, detection and segmentation. MobileNetV3-Large is 3.2% more accurate on ImageNet classification while reducing latency by 20% compared to MobileNetV2. MobileNetV3-Small is 6.6% more accurate compared to a MobileNetV2 model with comparable latency. MobileNetV3-Large detection is over 25% faster at roughly the same accuracy as Mo-bileNetV2 on COCO detection. MobileNetV3-Large LR-ASPP is 34% faster than MobileNetV2 R-ASPP at similar accuracy for Cityscapes segmentation.
translated by 谷歌翻译
A further understanding of cause and effect within observational data is critical across many domains, such as economics, health care, public policy, web mining, online advertising, and marketing campaigns. Although significant advances have been made to overcome the challenges in causal effect estimation with observational data, such as missing counterfactual outcomes and selection bias between treatment and control groups, the existing methods mainly focus on source-specific and stationary observational data. Such learning strategies assume that all observational data are already available during the training phase and from only one source. This practical concern of accessibility is ubiquitous in various academic and industrial applications. That's what it boiled down to: in the era of big data, we face new challenges in causal inference with observational data, i.e., the extensibility for incrementally available observational data, the adaptability for extra domain adaptation problem except for the imbalance between treatment and control groups, and the accessibility for an enormous amount of data. In this position paper, we formally define the problem of continual treatment effect estimation, describe its research challenges, and then present possible solutions to this problem. Moreover, we will discuss future research directions on this topic.
translated by 谷歌翻译
Text-based speech editing allows users to edit speech by intuitively cutting, copying, and pasting text to speed up the process of editing speech. In the previous work, CampNet (context-aware mask prediction network) is proposed to realize text-based speech editing, significantly improving the quality of edited speech. This paper aims at a new task: adding emotional effect to the editing speech during the text-based speech editing to make the generated speech more expressive. To achieve this task, we propose Emo-CampNet (emotion CampNet), which can provide the option of emotional attributes for the generated speech in text-based speech editing and has the one-shot ability to edit unseen speakers' speech. Firstly, we propose an end-to-end emotion-selectable text-based speech editing model. The key idea of the model is to control the emotion of generated speech by introducing additional emotion attributes based on the context-aware mask prediction network. Secondly, to prevent the emotion of the generated speech from being interfered by the emotional components in the original speech, a neutral content generator is proposed to remove the emotion from the original speech, which is optimized by the generative adversarial framework. Thirdly, two data augmentation methods are proposed to enrich the emotional and pronunciation information in the training set, which can enable the model to edit the unseen speaker's speech. The experimental results that 1) Emo-CampNet can effectively control the emotion of the generated speech in the process of text-based speech editing; And can edit unseen speakers' speech. 2) Detailed ablation experiments further prove the effectiveness of emotional selectivity and data augmentation methods. The demo page is available at https://hairuo55.github.io/Emo-CampNet/
translated by 谷歌翻译
In this work, we study the black-box targeted attack problem from the model discrepancy perspective. On the theoretical side, we present a generalization error bound for black-box targeted attacks, which gives a rigorous theoretical analysis for guaranteeing the success of the attack. We reveal that the attack error on a target model mainly depends on empirical attack error on the substitute model and the maximum model discrepancy among substitute models. On the algorithmic side, we derive a new algorithm for black-box targeted attacks based on our theoretical analysis, in which we additionally minimize the maximum model discrepancy(M3D) of the substitute models when training the generator to generate adversarial examples. In this way, our model is capable of crafting highly transferable adversarial examples that are robust to the model variation, thus improving the success rate for attacking the black-box model. We conduct extensive experiments on the ImageNet dataset with different classification models, and our proposed approach outperforms existing state-of-the-art methods by a significant margin. Our codes will be released.
translated by 谷歌翻译
Event-based simulations of Spiking Neural Networks (SNNs) are fast and accurate. However, they are rarely used in the context of event-based gradient descent because their implementations on GPUs are difficult. Discretization with the forward Euler method is instead often used with gradient descent techniques but has the disadvantage of being computationally expensive. Moreover, the lack of precision of discretized simulations can create mismatches between the simulated models and analog neuromorphic hardware. In this work, we propose a new exact error-backpropagation through spikes method for SNNs, extending Fast \& Deep to multiple spikes per neuron. We show that our method can be efficiently implemented on GPUs in a fully event-based manner, making it fast to compute and precise enough for analog neuromorphic hardware. Compared to the original Fast \& Deep and the current state-of-the-art event-based gradient-descent algorithms, we demonstrate increased performance on several benchmark datasets with both feedforward and convolutional SNNs. In particular, we show that multi-spike SNNs can have advantages over single-spike networks in terms of convergence, sparsity, classification latency and sensitivity to the dead neuron problem.
translated by 谷歌翻译
The error Backpropagation algorithm (BP) is a key method for training deep neural networks. While performant, it is also resource-demanding in terms of computation, memory usage and energy. This makes it unsuitable for online learning on edge devices that require a high processing rate and low energy consumption. More importantly, BP does not take advantage of the parallelism and local characteristics offered by dedicated neural processors. There is therefore a demand for alternative algorithms to BP that could improve the latency, memory requirements, and energy footprint of neural networks on hardware. In this work, we propose a novel method based on Direct Feedback Alignment (DFA) which uses Forward-Mode Automatic Differentiation to estimate backpropagation paths and learn feedback connections in an online manner. We experimentally show that Directional DFA achieves performances that are closer to BP than other feedback methods on several benchmark datasets and architectures while benefiting from the locality and parallelization characteristics of DFA. Moreover, we show that, unlike other feedback learning algorithms, our method provides stable learning for convolution layers.
translated by 谷歌翻译
Copy-Paste is a simple and effective data augmentation strategy for instance segmentation. By randomly pasting object instances onto new background images, it creates new training data for free and significantly boosts the segmentation performance, especially for rare object categories. Although diverse, high-quality object instances used in Copy-Paste result in more performance gain, previous works utilize object instances either from human-annotated instance segmentation datasets or rendered from 3D object models, and both approaches are too expensive to scale up to obtain good diversity. In this paper, we revisit Copy-Paste at scale with the power of newly emerged zero-shot recognition models (e.g., CLIP) and text2image models (e.g., StableDiffusion). We demonstrate for the first time that using a text2image model to generate images or zero-shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy-Paste truly scalable. To make such success happen, we design a data acquisition and processing framework, dubbed "X-Paste", upon which a systematic study is conducted. On the LVIS dataset, X-Paste provides impressive improvements over the strong baseline CenterNet2 with Swin-L as the backbone. Specifically, it archives +2.6 box AP and +2.1 mask AP gains on all classes and even more significant gains with +6.8 box AP +6.5 mask AP on long-tail classes.
translated by 谷歌翻译
Traditional learning-based approaches to student modeling (e.g., predicting grades based on measured activities) generalize poorly to underrepresented/minority student groups due to biases in data availability. In this paper, we propose a Multi-Layer Personalized Federated Learning (MLPFL) methodology which optimizes inference accuracy over different layers of student grouping criteria, such as by course and by demographic subgroups within each course. In our approach, personalized models for individual student subgroups are derived from a global model, which is trained in a distributed fashion via meta-gradient updates that account for subgroup heterogeneity while preserving modeling commonalities that exist across the full dataset. To evaluate our methodology, we consider case studies of two popular downstream student modeling tasks, knowledge tracing and outcome prediction, which leverage multiple modalities of student behavior (e.g., visits to lecture videos and participation on forums) in model training. Experiments on three real-world datasets from online courses demonstrate that our approach obtains substantial improvements over existing student modeling baselines in terms of increasing the average and decreasing the variance of prediction quality across different student subgroups. Visual analysis of the resulting students' knowledge state embeddings confirm that our personalization methodology extracts activity patterns which cluster into different student subgroups, consistent with the performance enhancements we obtain over the baselines.
translated by 谷歌翻译
The tradeoff between performance and inference speed is critical for practical applications. Architecture reparameterization obtains better tradeoffs and it is becoming an increasingly popular ingredient in modern convolutional neural networks. Nonetheless, its quantization performance is usually too poor to deploy (e.g. more than 20% top-1 accuracy drop on ImageNet) when INT8 inference is desired. In this paper, we dive into the underlying mechanism of this failure, where the original design inevitably enlarges quantization error. We propose a simple, robust, and effective remedy to have a quantization-friendly structure that also enjoys reparameterization benefits. Our method greatly bridges the gap between INT8 and FP32 accuracy for RepVGG. Without bells and whistles, the top-1 accuracy drop on ImageNet is reduced within 2\% by standard post-training quantization.
translated by 谷歌翻译
Task allocation is an important problem for robot swarms to solve, allowing agents to use reduce task completion time by performing tasks in a distributed fashion. Existing task allocation algorithms often assume prior knowledge of task location and demand or fail to consider the effects of the geometric distribution of tasks on the completion time and communication cost of the algorithms. In this paper, we examine an environment where agents must explore and discover tasks with positive demand and successfully assign themselves to complete all such tasks. We propose two new task allocation algorithms for initially unknown environments -- one based on N-site selection and the other on virtual pheromones. We analyze each algorithm separately and also evaluate the effectiveness of the two algorithms in dense vs. sparse task distributions. Compared to the Levy walk, which has been theorized to be optimal for foraging, our virtual pheromone inspired algorithm is much faster in sparse to medium task densities but is communication and agent intensive. Our site selection inspired algorithm also outperforms Levy walk in sparse task densities and is a less resource-intensive option than our virtual pheromone algorithm for this case. Because the performance of both algorithms relative to random walk is dependent on task density, our results shed light on how task density is important in choosing a task allocation algorithm in initially unknown environments.
translated by 谷歌翻译